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1. INTRODUCTION 

Back in 2017, half of Pulau Pinang had submerged with floodwaters particularly in island as well 
as in mainland. It had caused a lot of damages and fatal in some areas. It is needed to find a suitable method 
that can predict long term prediction of rainfall. From finding a suitable method will be able to assist the 
authorities or certain parties to be well-prepared and make plans to prevent these water-related problems 
from happening. 

Other than that, rainfall forecasting is very important in agriculture fields which can also help in 
decision making and performing strategic planning. The ability to predict and forecast rainfall quantitatively 
can help crop planting decisions, reservoir water resource allocation, traffic control, the operation of sewer 
systems and confronting water-related problems such as flood and drought [1]. 

Previous researchers have shown an increased interest in model development of time series in 
using rainfall data. There were several attempts in forecasting rainfall data using various techniques and 
methods in which can produce a well development model. Forecasting method has become very popular 
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among researchers due to the development of the data collection tools, computation methods and computer 
software to perform the analysis. There are several methods can be used to perform the time series 
forecasting analysis which are Multiple Linear Regression, Genetic Algorithm, Support Vector Machine 
(SVM), and Fuzzy Logic [2-3]. 

Many researchers used ANN as the basic idea of model combination in forecasting rainfall models 
to capture different patterns of the data. The parallel distributed processing architecture of ANN has proved 
to be a very powerful computational tool which is now being used in several fields to model the dynamic 
processes successfully including the rainfall [4-5]. However, a major problem with this kind of application is 
that many researchers have tried to compare the performance of the ARIMA and ANN models, but their 
results were different from each other. 

An extensive comparative study between ARIMA and ANN model were carried out by many past 
researchers in order to investigate the most suitable models that can be used to forecast rainfall data. 
ANN is a non-linear model which have been widely used for resolving forecasting problem as 
identified by [6-11]. Previous researchers such as in [12-13] had performed a comparative analysis between 
ARIMA and ANN model to determine the appropriate models for rainfall forecasting. They conclude that 
ANN method was appropriate in forecasting rainfall which outperforms ARIMA methods. Similarly, [14] 
also found that ANN can outperforms the ARIMA models. They confirmed that ANN methods are capable in 
modelling the complex Rainfall-runoff relationship. 

Artificial Neural Networks were widely used in many different fields such as digital image [15], 
fault detection [16], gold price forecasting [17], healthcare [18] and many more. For an instance, [19] used 
ANN to forecast daily rainfall on Turkey to observe wavelet of the ANN model. As a result, a close 
estimation for daily rainfall peaks was observed. Similarly, a study conducted by [20] presented a successful 
integration of wavelet and ANN for monthly rainfall predictions in India. Furthermore, [21] used Back- 
Propagation (BP) of ANN to model an hourly rainfall runoff. The results show that BP of ANN can perform 
satisfactorily and proved to be superior model which presented an acceptable ability to find the relationship 
between rainfall and runoff using only rainfall data and runoff data. 

Studies by [22-23] examined the forecasting models from different techniques to make a comparison 
in identifying the best model for prediction of rainfall. The author found out that ANN approach is better than 
any models as it can analyses a non-linear behaviour pattern of rainfall. Other than that, according to [24-25] 
who had carried out three rainfall forecasting models based on a monthly data basis where he used ARIMA, 
ANN and MLR techniques to analyses his research. The results that he obtained was the application of neural 
network model were more capable of forecasting an accurate result compared to the other two models. 

A study from [26] proposed a dynamic recurrent time-delay neural network for monthly 
rainfall forecast in Queensland, Australia. The network prototype models have a lower error when compared 
with the forecasts generated by the standard models used by the Australian Bureau of Meteorology. 
Moreover, they carried out three rainfall forecasting models that were developed based on ARIMA, 
ANN and multiple linear regression. The rainfall was estimated based on a monthly basis. They observe that 
the multilayer feed forward Back-Propagation (BP) neural network model forecast was better than the other 
two models. 

The aim of this study is to compare which method can provide more accurate forecasting results for 
daily rainfall data. The data are analyzed by using two methods which are ARIMA and ANN methods. 
The models obtained from each method will be evaluated and compared based on the forecasting 
performances measurement of Mean Absolute Error, Mean Forecast Error, Root Mean Square Error and 
coefficient of determination. The models that provides a high accuracy will be chosen as the best model to 
forecast daily rainfall. 


2. RESEARCH METHOD 

There are several methods that can be used to forecast rainfall data. In this study, two methods were 
applied to forecast rainfall data which are ARIMA model and ANN model. Firstly, the data need to 
undergoes data pre-processing before proceeding with any time series method. These techniques are involved 
with data normalization, data lagging and data splitting. 

Data normalization is one of methods used in data pre-processing to obtain the precision 
of the forecast of the model. The range of data used in this method were between [0,1]. The purpose to use 
this normalization range is because sigmoid (logistic) activation function was used in this study. 
Next, for data lagging is where it involves with layers of nodes which consist of input nodes, hidden layers 
and output nodes. In this research, the input variables are constructed through trial and error method. 

The data are need to be splitting into training and testing set. In data splitting, training set 
will have more allocation than testing set with a proportion of 90% versus 10%, 80% versus 20% or 70% 
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versus 30% [27]. Throughout this study, the data will be splitted into 80% and 20% which consists of 876 
training data and 220 testing data. Other than that, model assumptions are also being applied in the 
preliminary test to check the stationarity and normality of the data for forecasting purposes. 


2.1. Study area 

For this analysis, daily rainfall data from station 5204048, Simpang Ampat, Pulau Pinang as a case 
study. The data was taken for 3 years which was from January 2016 until December 2018. The data consists 
of information on daily totals of rainfall (in mm), minimum, maximum and total rainfall per month as well as 
annual rainfall. The daily totals of rainfall were used as the variables for this research. After that, the data 
undergoes data normalizing, lagging and splitting process. Figure 1 shows the map of the study area. 


Figure |. Map of the study area 


2.2. ARIMA model 
ARIMA model who was introduced by Box & Jenkins is one of the most popular forecasting 
methods in research and practice. Generally, ARIMA model is referred to as an ARIMA (p,d,q) 


model where p is the order of the AR component, d is the degree of the number of times the series 
has been differencing and g is the order of the MA component which are non-negative integers [28]. 
These Box-Jenkins procedures were involved with model identification, parameter estimation and model 
diagnostic checking [29]. 

The first step in model identification are to determine the time series is stationary or not stationary. 
If the data is not stationary, a non-seasonal differencing can be applied to the data to make it stationary. 
After that, the models can be identified according to the guideline of the autocorrelations functions (ACF) 
plot and partial autocorrelations function (PACF) plot. The guideline for the model identification are 
shown in Table 1. 


Table 1. Guideline for ARIMA model identification 


ACF PACF Model 
Dies down Cuts off after lag p AR(p) 
Cuts off after lag g Dies down MA(q) 
Dies down Dies down ARMA(p,q) 
Cuts off after lag g Cuts off after lag p AR(p) or MA(q) 


Next, after the models have been identified, the estimation of the constant and coefficients of the 
equation must be obtained. For model estimation, we need to estimate the parameter for a tentative model 
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that has been selected. Parameters that are judged significantly different from zero are retained in the fitted 
model while parameters that are not significant are dropped from the model. 

Last but least, the adequacy of the model must be check in model diagnostic checking step. 
Box-Ljung test is used for testing the lack of fit of a time series model and residual of the time series are 
correlated or uncorrelated after fitting an ARIMA model to the data. The model with the smallest p-value of 
the estimated parameter value and the highest p-value of the Box-Ljung test was chosen. Other than that, 
the model is also being selected using the Akaike Information Criterion (AIC) criteria. The model with the 
smallest AIC value will be chosen which shows an adequate model. 


2.3. ANN model 

ANN is one of the other methods that can be applied in time series analysis which is widely 
used by the researchers. There are several steps that are required in order to successfully forecast the neural 
networks model. The steps are network architecture, learning algorithm and the activation functions. 
Data normalization are often performed before executing with the ANN model in training process. 
The input data need to be normalized according to the activation function used. It can help to minimize the 
error of the model. 

In this study, MLP structure consists of three layers which are input layer, hidden layer and 
output layer. A total of 35 MLP network models were developed using daily rainfall data. For input nodes, 
it is determined according to data lagging technique. The application of using data lagging technique was to 
evaluate the forecasting performance of the model in details and the capability of the models. The generated 
lagging observations were obtained from trial and error of input variables. The training number of hidden 
layer nodes used in this modelling are from 2 to 10 which is based on previous researcher’s paper and also 
through trial and error. Furthermore, there is only one node used for output layer. 

In modelling of ANN model for daily rainfall data, the network that applied was MLP which 
contains input, hidden and output layer. The models were trained based on learning method which is the 
gradient descent back-propagation algorithm. This algorithm consists of two parameters which are learning 
rate (Jr) and momentum coefficient (mc). The parameters were determine based on previous researcher’s 
paper and through trial and error method. The neural network model was trained with /r of 0.3, and mc 
parameters of 0.2 and number of training epochs was 1000. 

For the activation function, two activation function were needed to link the neurons. For this study, 
Sigmoid (logistic) activation function was used for the hidden layer. This activation function keeps the range 
for the hidden layers to be within 0 to 1. Next, linear activation function was used at the output layer as there 
is only one result that is generated at the output layer, so the used of linear function is acceptable. Both of the 
equation are as follows: 


f(x) = purelin(x) =x (1) 


f (x) =log sig (x)= a 
(x)=logsig(x)=—* . 


where x is the input value. 


2.4. Forecasting performance measurement 

The performances of the models are evaluated by calculating difference between the observed 
rainfall data and the model generated rainfall data. According to [30-32], there are several performance 
evaluation methods which could be used for hydrological forecasting model. For this study, the forecasting 
performance is evaluated by using Mean Absolute Error (MAE), Mean Forecast Error (MFE), Root Mean 


Square Error (RMSE) and coefficient of determination (R e ). The forecasting model that provides 
the smallest value of MAE, MFE and RMSE were appointed as the best model for forecasting. In addition, 


for the value for R? which are between 0 and 1 were chosen which shows how well the data can fit 
the model. The formula is shown below: 


MAE= =! (3) 
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(4) 


(5) 


R ny wv-(ds)(dy) (6) 
(X*)-(D9 fo(D)-(D 


2 


R2 ny xy-(¥x)(>y) (7) 
(nS )-(Say aE7)-(DY 


where y; is the observed value at period f; 9, is predicted value at period t; n is the number of periods used in 


calculation; R is the correlation coefficient; R° is the coefficient of determination; / is the number of pairs 
of data; X is the observed value of rainfall data; y is the predicted value of rainfall data. 


3. RESULTS AND DISCUSSION 

The daily rainfall data sets were successfully forecasted using both ARIMA and ANN models. 
ARIMA (3,1,1) have the significant p-value for each of the parameters which is less than the significant level 
of @ = 0.05. Moreover, the smallest AIC value and largest Box-Ljung test are being selected with the value 
of 11060.89 and 0.2313 as the best model to forecast daily rainfall in Simpang Ampat, Pulau Pinang in 
ARIMA method. 

Furthermore, ANN model is capable of predicting and efficiently as this method were involve in a 
nonlinear modelling of rainfall data. The ANN structure consists of seven input nodes, two to ten hidden 
layer nodes and one output nodes. A Feed Forward Back-Propagation Neural Network of ANN model was 
developed where the model was trained based on gradient descent back-propagation algorithm with sigmoid 
(logistic) activation function for hidden layer and linear activation function for output layer. ANN (6,4,1) 
model was trained and tested. 

Next, the forecasting accuracy was measured according to observed and predicted value 
of the models. ARIMA and ANN models were evaluated and compared to see which method provides the 
most appropriate forecasting tools to forecast daily rainfall data. There is a difference based on the results 
obtained from accuracy checking between ARIMA and ANN models. The measurement of the level of 


accuracy is based on MAE, MFE, RMSE and R * criteria. 
From the analysis results, both of the models are capable as a forecasting tools to forecast daily 


rainfall data. The models that provides the smallest error of MAE, MFE, RMSE and the highest R > were 
appointed as the best forecasting models. Table 2 shows the comparison of the performance measure 
for ARIMA and ANN models. Based on the results, the performance measure for ANN model shows a 
better result compared to ARIMA model where ANN model outperforms ARIMA model. The error measure 
of the training and testing set for MAE in ARIMA model is 78.0877 and 102.7644 respectively. 
The error measure of ANN model has a lower value of error measure in both training and testing set of MAE 
at 25.6573 and 8.4208. 

While for MFE, if the value of MFE is larger than 0, it indicates the model is under-forecast. If it is 
less than 0, the model may lead to over-forecast. The MFE value for training and testing of ARIMA model 
were 2.9465 and -3.3436 which indicates that the model is under-forecast during training the model and tends 
to be over-forecast in testing set. This means that forecast for ARIMA model to be low in relation to the 
actual demand in training set. While for testing set, the forecast value is high in relation to actual demand. 
However, for ANN model, undergoes under-forecast in both training and testing set. The model tends to be 
under-forecast with an average absolute error of 13.1511 in training and 2.2188 in testing which the forecast 
has been in a low relation to the actual demand. 
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The two models were being compared based on RMSE value where it is used to computes 
the variations of the observed values and the predicted values. The smaller the value of RMSE, the better 
and accurate the results of the forecast would be. The RMSE value for ANN model are lower, while for 
ARIMA model is higher. This means that ANN model can provide more accurate results in forecasting 
daily rainfall data. 


The coefficient of determination, R’can also be measured and compared. This measurement 
error helps to see whether it gives the fluctuations percentages of one variables that is predicted from 


other variables. The value of R~ is between 0 and 1 where it denotes the strength of a linear 
association between x and y. In addition, it represents the percent of data that is closest to the line of best fit. 
The closer its value to 1, the better the fit or relationship between the variables. The results of ANN model 


for training and testing set shows the value of R * is between 0 and 1 that is high at 0.8227 and 0.9432. 
ANN model shows a better fit and positive relationship between the variables compared to ARIMA model. 
Hence, ANN model outperforms ARIMA model and proven to model and forecast daily rainfall data. 
Figure 2 displayed a graph of comparison of ARIMA model versus ANN model. 


Table 2. Comparison of the performance measure of selected ARIMA and ANN models 


Error Measure for Daily Rainfall Data in Training Set Error Measure for Daily Rainfall Data in Testing Set 


Method MAE MFE RMSE R2 MAE MFE RMSE a 
ARIMA G., 1.1) 78.0877 2.9465 131.1353 0.1231 102.7644. --3.3436 —~*138.9895 0.0144 
ANN (6,4,1) 25.6573 «13.1511 57.2931 0.8227 8.4208 2.2188 34.6740 0.9432 


Notes: Bold is the smallest value. 
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Figure 2. Comparison of ARIMA model vs ANN model 


4. CONCLUSION 

Based on the results obtained, the two models are compared and evaluated in order to find the best 
forecasting model to forecast rainfall data. The purpose of comparing these two models was to find the most 
suitable method to forecast daily rainfall data which contains the minimum accuracy measure. 
For this comparison, there are three types of error measure that was being used to evaluate the accuracy 
measure of the models which are MAE, MFE and RMSE. The smaller the error, the more accurate the 


: she uae 2 
forecasted results of the models would achieve. Moreover, the coefficient of determination, R~ were also 
being measured as to inspect the better fit or relationship between the variables. The model that has the 


lowest error and R . closest value to 1 for better fit of the variables are selected as the best model. 

The ANN (6,4,1) shows the MFE value of 2.2188 that is smaller compared to ARIMA (3,1,1). 
It shows that ANN model is under-forecast which the forecasted results has been in a low relation to actual 
demand of 2.2188 mm in average for that day to rain. While, MFE for ARIMA (3,1,1) has the value of 
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-3.3436 mm in average which gives the meaning that the forecasted value for ARIMA model are too high to 
actual demand which tends to be over-forecast the results. Thus, ANN model was chosen as the best model 
that gives the lowest error on MAE, MFE and RMSE compared to ARIMA. In addition, a higher value of 


R ‘ which fitted the model to the variables much better. 
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